Topic: Forest Fires in the Northeast Region of Portugal

Overview and Motivation

  • Provide an overview of the project goals and the motivation for it.

Forest fires are among the most common ecological disasters throughout the world. In the US, the ongoing record-breaking forest fires in California have caused great concerns. Although forest fires play an important role in the natural cycle, it poses devastating threats to local communities and biodiversity. Negative effects of forest fires include pollution, deforestation, health, as well as economic losses in terms of forest fire suppression. To better predict forest fires, forest fires modeling has been used in an attempt to characterize fire behavior and improve the safety of firefighters. Thus, we would like to identify significant variables in forest fires and develop a simple model to help understand the patterns of forest fires.

Initial Questions:

  • What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?

The first question we would like to answer is what are the influential factors in predicting forest fires. Initially, we would like to use visualization to explore the data set, and then then use regression analysis to investigate significant variables. After learning shiny app, we decided to incorporate it into the visualization section to better illustrate the time-dependent distribution. As the course progressed, we were exposed to various modeling methods for machine learning and formed a more concrete idea about use modeling to predict forest fires.

Data: Source, scraping method, cleanup, etc.

The forest fire data we use is from Montesinho Natural Park in the northeast region of Portugal collected from January 2000 to December 2003 with a total of 517 entries. (weblink:http://archive.ics.uci.edu/ml/datasets/Forest+Fires) Features of forest fires include spatial information of forest fire according to a 9x9 grid map, temporal information, the forest Fire Weather Index (FWI), and meteorological information.

For temporal information, month and day were selected to take into account of both natural factor and human cause; four weather attributes—FFMC, DMC, DC, and ISI—were chosen for FWI; meteorological information includes temperature, relative humidity, wind speed, outside rain, and the burned area of the forest.

# Add season category 
fire <- read.csv("forestfires.csv", header=TRUE, sep = ",")
fire$season <- rep("spring", 517)
for (i in 1:517){
if (fire$month[i] %in% c("feb","jan","dec")) fire$season[i] <- "winter"
if (fire$month[i] %in% c("oct","nov","sep")) fire$season[i] <- "autumn"
if (fire$month[i] %in% c("aug","jul","jun")) fire$season[i] <- "summer"
}
fire$season <- as.factor(fire$season)

fire$season.cat <- rep(0, 517)
for (i in 1:517){
  if (fire$season[i] == "summer") {
    fire$season.cat[i] <- 1
  }
  if (fire$season[i] == "autumn") {
    fire$season.cat[i] <- 2
  }
  if (fire$season[i] =="winter") {
    fire$season.cat[i] <- 3
  }
}
head(fire)
##   X Y month day FFMC  DMC    DC  ISI temp RH wind rain area season season.cat
## 1 7 5   mar fri 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0    0 spring          0
## 2 7 4   oct tue 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0    0 autumn          2
## 3 7 4   oct sat 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0    0 autumn          2
## 4 8 6   mar fri 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2    0 spring          0
## 5 8 6   mar sun 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0    0 spring          0
## 6 8 6   aug sun 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0    0 summer          1

Our dataset was good for analysis, no additional scraping was needed. Thus, for the cleanup, we were more focused on identifying any outliers or potential influential points that may affect our regression model fitting.

Since one regression model we were trying to fit is linear regression model which has an assumption of normality and our outcome burning area is not normally distributed, thus we did a log transformation on the burning areas above 0 to approximate normal distribution. Next, we identified ourliers and potential influential points in our dataset based on leverage and Cook’s Distance.

# Area log transformation (for area>0)
hist(fire$area,40, main = "Histogram of area", xlab = "Area")
fire["logarea"] <- ifelse(fire$area >0, log(fire$area), NA)
ggplot(data=fire, aes(x=logarea))+
  geom_histogram(aes(y=..density..), col="black",fill="white")+
  stat_function(fun=dnorm, args = list(mean=mean(fire$logarea, na.rm = TRUE), sd = sd(fire$logarea, na.rm=TRUE)),col="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 247 rows containing non-finite values (stat_bin).

# Outliers / Cook's Distance
area_posit <- fire[which(fire$area>0),]
mod_lin <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=area_posit[c(seq(1,12),15,16)])
ols_plot_cooksd_bar(mod_lin)  #number 262 datapoint in this data set, which is id=500
ols_plot_resid_lev(mod_lin)
ols_plot_dffits(mod_lin)

cooksd <- cooks.distance(mod_lin)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd+0.001, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red")

# Get rid off the outliers
influential = which(cooksd>4*mean(cooksd, na.rm=T))  # influential points
influ <- area_posit[influential, ]   #all data for influential points
influ
##     X Y month day FFMC   DMC    DC  ISI temp RH wind rain    area season
## 139 9 9   jul tue 85.8  48.3 313.4  3.9 18.0 42  2.7  0.0    0.36 summer
## 144 1 2   jul sat 90.0  51.3 296.3  8.7 16.6 53  5.4  0.0    0.71 summer
## 148 8 3   sep tue 84.4  73.4 671.9  3.2 24.2 28  3.6  0.0    0.96 autumn
## 200 2 4   sep mon 63.5  70.8 665.3  0.8 22.6 38  3.6  0.0   11.32 autumn
## 239 6 5   sep sat 92.5 121.1 674.4  8.6 25.1 27  4.0  0.0 1090.84 autumn
## 267 6 5   aug tue 94.3 131.7 607.1 22.7 19.4 55  4.0  0.0    0.17 summer
## 416 8 6   aug thu 94.8 222.4 698.6 13.9 27.5 27  4.9  0.0  746.28 summer
## 421 8 8   aug wed 91.7 191.4 635.9  7.8 26.2 36  4.5  0.0  185.76 summer
## 470 6 3   apr sun 91.0  14.6  25.6 12.3 13.7 33  9.4  0.0   61.13 spring
## 474 9 4   jun sat 90.5  61.1 252.6  9.4 24.5 50  3.1  0.0   70.32 summer
## 480 7 4   jul mon 89.2 103.9 431.6  6.4 22.6 57  4.9  0.0  278.53 summer
## 500 7 5   aug tue 96.1 181.1 671.2 14.3 27.3 63  4.9  6.4   10.82 summer
## 514 2 4   aug sun 81.6  56.7 665.6  1.9 21.9 71  5.8  0.0   54.29 summer
##     season.cat     logarea
## 139          1 -1.02165125
## 144          1 -0.34249031
## 148          2 -0.04082199
## 200          2  2.42657107
## 239          2  6.99470332
## 267          1 -1.77195684
## 416          1  6.61510086
## 421          1  5.22445552
## 470          0  4.11300274
## 474          1  4.25305625
## 480          1  5.62952577
## 500          1  2.38139627
## 514          1  3.99434005
dat <- area_posit[-influential,]  #get rid off the influential points
head(dat)
##     X Y month day FFMC   DMC    DC  ISI temp RH wind rain area season
## 140 1 4   sep tue 91.0 129.5 692.6  7.0 21.7 38  2.2    0 0.43 autumn
## 141 2 5   sep mon 90.9 126.5 686.5  7.0 21.9 39  1.8    0 0.47 autumn
## 142 1 2   aug wed 95.5  99.9 513.3 13.2 23.3 31  4.5    0 0.55 summer
## 143 8 6   aug fri 90.1 108.0 529.8 12.5 21.2 51  8.9    0 0.61 summer
## 145 2 5   aug wed 95.5  99.9 513.3 13.2 23.8 32  5.4    0 0.77 summer
## 146 6 5   aug thu 95.2 131.7 578.8 10.4 27.4 22  4.0    0 0.90 summer
##     season.cat    logarea
## 140          2 -0.8439701
## 141          2 -0.7550226
## 142          1 -0.5978370
## 143          1 -0.4942963
## 145          1 -0.2613648
## 146          1 -0.1053605

We created a new data set excluding these points and wanted to see if fitting our regression models without these influential points helped improve our model fitting.

# New Dataset
new_dat <- rbind(fire[which(fire$area==0),], dat)   #join the area_positive w/o influential points to the data w/ area=0
new_dat$burn <- ifelse(new_dat$area==0,0,1)  #to get the new dataset without the influential points
head(new_dat)
##   X Y month day FFMC  DMC    DC  ISI temp RH wind rain area season season.cat
## 1 7 5   mar fri 86.2 26.2  94.3  5.1  8.2 51  6.7  0.0    0 spring          0
## 2 7 4   oct tue 90.6 35.4 669.1  6.7 18.0 33  0.9  0.0    0 autumn          2
## 3 7 4   oct sat 90.6 43.7 686.9  6.7 14.6 33  1.3  0.0    0 autumn          2
## 4 8 6   mar fri 91.7 33.3  77.5  9.0  8.3 97  4.0  0.2    0 spring          0
## 5 8 6   mar sun 89.3 51.3 102.2  9.6 11.4 99  1.8  0.0    0 spring          0
## 6 8 6   aug sun 92.3 85.3 488.0 14.7 22.2 29  5.4  0.0    0 summer          1
##   logarea burn
## 1      NA    0
## 2      NA    0
## 3      NA    0
## 4      NA    0
## 5      NA    0
## 6      NA    0
md.pattern(new_dat)  #no missing value

##     X Y month day FFMC DMC DC ISI temp RH wind rain area season season.cat burn
## 257 1 1     1   1    1   1  1   1    1  1    1    1    1      1          1    1
## 247 1 1     1   1    1   1  1   1    1  1    1    1    1      1          1    1
##     0 0     0   0    0   0  0   0    0  0    0    0    0      0          0    0
##     logarea    
## 257       1   0
## 247       0   1
##         247 247
summary(new_dat)
##        X               Y            month               day           
##  Min.   :1.000   Min.   :2.000   Length:504         Length:504        
##  1st Qu.:3.000   1st Qu.:4.000   Class :character   Class :character  
##  Median :4.000   Median :4.000   Mode  :character   Mode  :character  
##  Mean   :4.633   Mean   :4.288                                        
##  3rd Qu.:7.000   3rd Qu.:5.000                                        
##  Max.   :9.000   Max.   :9.000                                        
##                                                                       
##       FFMC            DMC               DC             ISI        
##  Min.   :18.70   Min.   :  1.10   Min.   :  7.9   Min.   : 0.000  
##  1st Qu.:90.28   1st Qu.: 70.53   1st Qu.:441.8   1st Qu.: 6.500  
##  Median :91.60   Median :108.30   Median :664.2   Median : 8.400  
##  Mean   :90.71   Mean   :111.10   Mean   :549.0   Mean   : 9.028  
##  3rd Qu.:92.90   3rd Qu.:142.40   3rd Qu.:714.3   3rd Qu.:10.725  
##  Max.   :96.20   Max.   :291.30   Max.   :860.6   Max.   :56.100  
##                                                                   
##       temp             RH              wind            rain         
##  Min.   : 2.20   Min.   : 15.00   Min.   :0.400   Min.   :0.000000  
##  1st Qu.:15.40   1st Qu.: 32.75   1st Qu.:2.700   1st Qu.:0.000000  
##  Median :19.20   Median : 41.50   Median :4.000   Median :0.000000  
##  Mean   :18.80   Mean   : 44.28   Mean   :4.001   Mean   :0.009524  
##  3rd Qu.:22.73   3rd Qu.: 53.00   3rd Qu.:4.900   3rd Qu.:0.000000  
##  Max.   :33.30   Max.   :100.00   Max.   :9.400   Max.   :1.400000  
##                                                                     
##       area            season      season.cat       logarea       
##  Min.   :  0.000   autumn:185   Min.   :0.000   Min.   :-2.4080  
##  1st Qu.:  0.000   spring: 64   1st Qu.:1.000   1st Qu.: 0.7608  
##  Median :  0.420   summer:224   Median :1.000   Median : 1.8083  
##  Mean   :  8.196   winter: 31   Mean   :1.363   Mean   : 1.7885  
##  3rd Qu.:  6.365                3rd Qu.:2.000   3rd Qu.: 2.6596  
##  Max.   :212.880                Max.   :3.000   Max.   : 5.3607  
##                                                 NA's   :247      
##       burn       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5099  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
## 

Exploratory Analysis:

  • What visualizations did you use to look at your data in different ways? What are the different statistical methods you considered? Justify the decisions you made, and show any major changes to your ideas. How did you reach these conclusions?

We used R shiny app to visualize our data, getting a rough idea of frequency, size, and distribution of forest fires. We were interested in weather observations that are associated with the FWI index: temperature, relative humidity, wind speed, and rain. For each of these variables, a plot is generated to describe its relationship with forest fires by month. To get a more direct interpretation, we also plotted the count and area of forest fires by variables of interest. The R shiny app is in the github repository and is also shown on google site. The following pictures are the screenshots of the shiny app.

visualization_1 visualization_1

For statistical methods, we used linear regression model and logistic regression model based our data. Since our outcome in our data is burning area which is a continuous variable and is not a count, we considered linear regression model; then we categorize the outcome as a binary outcome – burn = 1 if burning area>0, burn = 0 if burning area =0, logistic regression model is being used here. By comparing these regression models, we mainly looked at \(R^2\), adjusted \(R^2\), square root of MSE (mean squared error) for linear regression models, AUC, GOF (goodness-of-fit) test for logistic regression models and AIC and BIC for all.

After learning about machine learning, we selected important variables in logistic regression models and used them to predict whether or not there’s going to have forest fires. We mainly focused on accuracy, specificity, sensitivity and AUC to compare machine learning models.

Below are the codes for linear, logistic regression models and machine learning models.

Linear Regression Models

# Fit linear regression model to area>0 since this is normally distributed
area_posit <- fire[which(fire$area>0),]
summary(area_posit)
##        X               Y            month               day           
##  Min.   :1.000   Min.   :2.000   Length:270         Length:270        
##  1st Qu.:3.000   1st Qu.:4.000   Class :character   Class :character  
##  Median :5.000   Median :4.000   Mode  :character   Mode  :character  
##  Mean   :4.807   Mean   :4.367                                        
##  3rd Qu.:7.000   3rd Qu.:5.000                                        
##  Max.   :9.000   Max.   :9.000                                        
##       FFMC            DMC              DC             ISI        
##  Min.   :63.50   Min.   :  3.2   Min.   : 15.3   Min.   : 0.800  
##  1st Qu.:90.33   1st Qu.: 82.9   1st Qu.:486.5   1st Qu.: 6.800  
##  Median :91.70   Median :111.7   Median :665.6   Median : 8.400  
##  Mean   :91.03   Mean   :114.7   Mean   :570.9   Mean   : 9.177  
##  3rd Qu.:92.97   3rd Qu.:141.3   3rd Qu.:721.3   3rd Qu.:11.375  
##  Max.   :96.20   Max.   :291.3   Max.   :860.6   Max.   :22.700  
##       temp             RH             wind            rain        
##  Min.   : 2.20   Min.   :15.00   Min.   :0.400   Min.   :0.00000  
##  1st Qu.:16.12   1st Qu.:33.00   1st Qu.:2.700   1st Qu.:0.00000  
##  Median :20.10   Median :41.00   Median :4.000   Median :0.00000  
##  Mean   :19.31   Mean   :43.73   Mean   :4.113   Mean   :0.02889  
##  3rd Qu.:23.40   3rd Qu.:53.00   3rd Qu.:4.900   3rd Qu.:0.00000  
##  Max.   :33.30   Max.   :96.00   Max.   :9.400   Max.   :6.40000  
##       area            season      season.cat      logarea       
##  Min.   :   0.09   autumn:102   Min.   :0.00   Min.   :-2.4079  
##  1st Qu.:   2.14   spring: 24   1st Qu.:1.00   1st Qu.: 0.7608  
##  Median :   6.37   summer:125   Median :1.00   Median : 1.8516  
##  Mean   :  24.60   winter: 19   Mean   :1.43   Mean   : 1.8448  
##  3rd Qu.:  15.42                3rd Qu.:2.00   3rd Qu.: 2.7358  
##  Max.   :1090.84                Max.   :3.00   Max.   : 6.9947
head(area_posit)
##     X Y month day FFMC   DMC    DC  ISI temp RH wind rain area season
## 139 9 9   jul tue 85.8  48.3 313.4  3.9 18.0 42  2.7    0 0.36 summer
## 140 1 4   sep tue 91.0 129.5 692.6  7.0 21.7 38  2.2    0 0.43 autumn
## 141 2 5   sep mon 90.9 126.5 686.5  7.0 21.9 39  1.8    0 0.47 autumn
## 142 1 2   aug wed 95.5  99.9 513.3 13.2 23.3 31  4.5    0 0.55 summer
## 143 8 6   aug fri 90.1 108.0 529.8 12.5 21.2 51  8.9    0 0.61 summer
## 144 1 2   jul sat 90.0  51.3 296.3  8.7 16.6 53  5.4    0 0.71 summer
##     season.cat    logarea
## 139          1 -1.0216512
## 140          2 -0.8439701
## 141          2 -0.7550226
## 142          1 -0.5978370
## 143          1 -0.4942963
## 144          1 -0.3424903
# Scatterplot with lowess curve
ggplot(data=area_posit, aes(x=DMC, y=logarea))+
  geom_point()+
  geom_smooth(method = "loess",se=FALSE)+
  theme_bw()
## `geom_smooth()` using formula 'y ~ x'
# Fit linear regression model with full dataset
mod_lin <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=area_posit[c(seq(1,12),15,16)])
summary(mod_lin)  # DMC and DC are being statistically significant 
## 
## Call:
## lm(formula = logarea ~ X + Y + month + day + FFMC + DMC + DC + 
##     ISI + temp + RH + wind + rain, data = area_posit[c(seq(1, 
##     12), 15, 16)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2131 -0.9790  0.0380  0.8145  4.4916 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.2117947  3.9629695   0.306 0.760033    
## X            0.0515407  0.0473061   1.090 0.277001    
## Y           -0.0944394  0.0960197  -0.984 0.326315    
## monthaug     0.7546454  1.3660544   0.552 0.581161    
## monthdec     1.9052997  1.1644253   1.636 0.103075    
## monthfeb    -0.2741962  0.9130864  -0.300 0.764207    
## monthjul     0.0373724  1.1677970   0.032 0.974496    
## monthjun    -0.4647905  1.0983596  -0.423 0.672545    
## monthmar    -0.4325527  0.8534419  -0.507 0.612730    
## monthmay     1.2928830  1.7376890   0.744 0.457578    
## monthoct     3.2689080  1.6666212   1.961 0.050970 .  
## monthsep     2.0552667  1.5337220   1.340 0.181475    
## daymon      -0.0005960  0.3603204  -0.002 0.998682    
## daysat       0.6039482  0.3413894   1.769 0.078128 .  
## daysun       0.4058246  0.3367923   1.205 0.229382    
## daythu       0.2295388  0.3752345   0.612 0.541292    
## daytue       0.3295887  0.3519106   0.937 0.349906    
## daywed       0.0269697  0.3705798   0.073 0.942043    
## FFMC         0.0071617  0.0433667   0.165 0.868969    
## DMC          0.0094934  0.0028344   3.349 0.000938 ***
## DC          -0.0051541  0.0020759  -2.483 0.013706 *  
## ISI         -0.0309518  0.0372786  -0.830 0.407190    
## temp         0.0393276  0.0348704   1.128 0.260504    
## RH          -0.0004741  0.0100132  -0.047 0.962274    
## wind         0.0517875  0.0584917   0.885 0.376822    
## rain         0.0457141  0.2420749   0.189 0.850373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.499 on 244 degrees of freedom
## Multiple R-squared:  0.1254, Adjusted R-squared:  0.0358 
## F-statistic:   1.4 on 25 and 244 DF,  p-value: 0.1036
tidy(mod_lin)
## # A tibble: 26 x 5
##    term        estimate std.error statistic p.value
##    <chr>          <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)   1.21      3.96      0.306    0.760
##  2 X             0.0515    0.0473    1.09     0.277
##  3 Y            -0.0944    0.0960   -0.984    0.326
##  4 monthaug      0.755     1.37      0.552    0.581
##  5 monthdec      1.91      1.16      1.64     0.103
##  6 monthfeb     -0.274     0.913    -0.300    0.764
##  7 monthjul      0.0374    1.17      0.0320   0.974
##  8 monthjun     -0.465     1.10     -0.423    0.673
##  9 monthmar     -0.433     0.853    -0.507    0.613
## 10 monthmay      1.29      1.74      0.744    0.458
## # … with 16 more rows
#Fit linear regression again, now without the influential points and outliers
mod_lin2 <- lm(logarea ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=dat)
summary(mod_lin2)
## 
## Call:
## lm(formula = logarea ~ X + Y + month + day + FFMC + DMC + DC + 
##     ISI + temp + RH + wind + rain, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9746 -0.9080 -0.0007  0.7724  3.5148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.987342   4.994567   0.198 0.843467    
## X            0.032411   0.044763   0.724 0.469771    
## Y           -0.101419   0.092957  -1.091 0.276398    
## monthaug     1.313931   1.374839   0.956 0.340224    
## monthdec     2.022285   1.150576   1.758 0.080135 .  
## monthfeb    -0.043227   0.925304  -0.047 0.962780    
## monthjul     0.789941   1.219786   0.648 0.517883    
## monthjun    -0.028542   1.132933  -0.025 0.979922    
## monthmar    -0.119666   0.904959  -0.132 0.894914    
## monthmay     1.731173   1.647245   1.051 0.294379    
## monthoct     3.860256   1.625441   2.375 0.018372 *  
## monthsep     2.582194   1.532596   1.685 0.093368 .  
## daymon      -0.115957   0.333272  -0.348 0.728206    
## daysat       0.525835   0.317920   1.654 0.099489 .  
## daysun       0.440884   0.312954   1.409 0.160244    
## daythu       0.088689   0.347098   0.256 0.798553    
## daytue       0.609855   0.329800   1.849 0.065711 .  
## daywed      -0.085819   0.340491  -0.252 0.801231    
## FFMC         0.020687   0.056889   0.364 0.716462    
## DMC          0.009157   0.002689   3.406 0.000778 ***
## DC          -0.005234   0.001983  -2.640 0.008856 ** 
## ISI         -0.011857   0.037883  -0.313 0.754582    
## temp        -0.011517   0.033153  -0.347 0.728616    
## RH          -0.008238   0.009472  -0.870 0.385330    
## wind         0.002567   0.055097   0.047 0.962878    
## rain         0.098441   1.030367   0.096 0.923969    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.362 on 231 degrees of freedom
## Multiple R-squared:  0.1515, Adjusted R-squared:  0.05972 
## F-statistic:  1.65 on 25 and 231 DF,  p-value: 0.03051
plot(mod_lin2, which=c(1,2,3))
## Warning: not plotting observations with leverage one:
##   232, 254

From the coefficients and P-values of this linear model fitting using the full dataset, we can see that DMC (p-value <0.001) and DC (p-value<0.05) are statistical significant.

From the coefficients and P-values of this linear model fitted using the dataset without the potential influential points, we can see that DMC and DC are still statistically significant, and now we have another statistically significant variable, ‘monthoct’ (p-value <0.05). Also, in this model, we have more variables with P-value close to 0.05.

We also check the LINE (Linearity, Independence, Normality, Equal variance) assumption for second linear regression model: seeing the plot of residuals vs. fitted for #2 model, the linearity and equal variance assumption of linear regression model holds here – the data points equally spread across the zero line. Seeing the normal Q-Q plot for #2 model, the normality assumption of linear regression model holds here – data point does not deviate much from the line.

table_12<- matrix(c(summary(mod_lin)$r.squared, summary(mod_lin)$adj.r.squared, sqrt(mean(mod_lin$residuals^2)),AIC(mod_lin), BIC(mod_lin),
                  summary(mod_lin2)$r.squared, summary(mod_lin2)$adj.r.squared, sqrt(mean(mod_lin2$residuals^2)),AIC(mod_lin2), BIC(mod_lin2)), ncol=5,nrow=2, byrow=TRUE)
colnames(table_12)<- c("R^2", "Adjusted R^2", "Square root of MSE","AIC", "BIC")
rownames(table_12)<- c("logarea ~.",
                     "logarea~., (w/o influential points)")
table_12 <- as.table(table_12)
kable(table_12)
R^2 Adjusted R^2 Square root of MSE AIC BIC
logarea ~. 0.1254112 0.0358017 1.425126 1011.5275 1108.685
logarea~., (w/o influential points) 0.1515462 0.0597222 1.291624 914.8673 1010.692

We compared these two models (#1 & #2) by values of R^2, adjusted R^2, square root of MSE, AIC and BIC. We would prefer higher value of R^2, adjusted R^2, and smaller values of square root of MSE, AIC and BIC. Here, the model fitted using dataset without the influential points have larger number of R^2, adjusted R^2, and smaller number of square root of MSE, AIC and BIC, indicating the 2nd model fits the data better. Therefore, in the further model fitting, we will use the dataset without influential points for better prediction.

Since we have a lot of variables to be considered, we also tried linear model with forward selection and backward elimination based on AIC value.

# Linear model with forward selection & backward elimination based on AIC
full.model <- lm(logarea ~., data = dat[c(seq(1,12),15,16)])
step.forw <- step(lm(logarea~1, data=dat[c(seq(1,12),15,16)]), ~X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, direction = "forward")
## Start:  AIC=175.77
## logarea ~ 1
## 
##         Df Sum of Sq    RSS    AIC
## + ISI    1    6.1318 499.20 174.63
## + temp   1    4.2722 501.06 175.59
## <none>               505.33 175.77
## + FFMC   1    1.8494 503.48 176.83
## + rain   1    1.0318 504.30 177.24
## + Y      1    0.7418 504.59 177.39
## + RH     1    0.6581 504.67 177.43
## + DC     1    0.6023 504.73 177.46
## + wind   1    0.3736 504.96 177.58
## + X      1    0.1436 505.19 177.69
## + DMC    1    0.0142 505.32 177.76
## + month  9   28.4319 476.90 178.89
## + day    6   16.8713 488.46 179.04
## 
## Step:  AIC=174.63
## logarea ~ ISI
## 
##         Df Sum of Sq    RSS    AIC
## <none>               499.20 174.63
## + RH     1    1.5252 497.68 175.84
## + rain   1    1.2029 498.00 176.01
## + Y      1    1.1403 498.06 176.04
## + DMC    1    0.9407 498.26 176.15
## + temp   1    0.9073 498.29 176.16
## + wind   1    0.6063 498.59 176.32
## + FFMC   1    0.4141 498.79 176.42
## + X      1    0.3244 498.88 176.46
## + DC     1    0.0082 499.19 176.63
## + day    6   16.7117 482.49 177.88
## + month  9   22.3065 476.89 180.88
step.back <- step(full.model, direction = "backward")
## Start:  AIC=183.53
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + 
##     RH + wind + rain + season.cat
## 
## 
## Step:  AIC=183.53
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + 
##     RH + wind + rain
## 
##         Df Sum of Sq    RSS    AIC
## - wind   1     0.004 428.76 181.53
## - rain   1     0.017 428.77 181.54
## - ISI    1     0.182 428.93 181.64
## - temp   1     0.224 428.98 181.67
## - FFMC   1     0.245 429.00 181.68
## - X      1     0.973 429.72 182.12
## - day    6    18.275 447.03 182.26
## - RH     1     1.404 430.16 182.37
## - Y      1     2.209 430.96 182.85
## <none>               428.75 183.53
## - month  9    37.254 466.01 186.95
## - DC     1    12.936 441.69 189.17
## - DMC    1    21.529 450.28 194.12
## 
## Step:  AIC=181.54
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + 
##     RH + rain
## 
##         Df Sum of Sq    RSS    AIC
## - rain   1     0.020 428.78 179.55
## - ISI    1     0.184 428.94 179.65
## - temp   1     0.236 428.99 179.68
## - FFMC   1     0.242 429.00 179.68
## - X      1     0.979 429.73 180.12
## - day    6    18.335 447.09 180.30
## - RH     1     1.412 430.17 180.38
## - Y      1     2.223 430.98 180.86
## <none>               428.76 181.53
## - month  9    37.269 466.02 184.96
## - DC     1    12.932 441.69 187.17
## - DMC    1    21.779 450.53 192.27
## 
## Step:  AIC=179.55
## logarea ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + 
##     RH
## 
##         Df Sum of Sq    RSS    AIC
## - ISI    1     0.201 428.98 177.67
## - temp   1     0.223 429.00 177.68
## - FFMC   1     0.250 429.03 177.70
## - X      1     0.981 429.76 178.13
## - day    6    18.354 447.13 178.32
## - RH     1     1.395 430.17 178.38
## - Y      1     2.237 431.01 178.88
## <none>               428.78 179.55
## - month  9    37.476 466.25 183.08
## - DC     1    13.013 441.79 185.23
## - DMC    1    21.864 450.64 190.33
## 
## Step:  AIC=177.67
## logarea ~ X + Y + month + day + FFMC + DMC + DC + temp + RH
## 
##         Df Sum of Sq    RSS    AIC
## - FFMC   1     0.083 429.06 175.72
## - temp   1     0.267 429.24 175.83
## - X      1     0.919 429.90 176.22
## - day    6    18.312 447.29 176.41
## - RH     1     1.558 430.53 176.60
## - Y      1     2.120 431.10 176.94
## <none>               428.98 177.67
## - month  9    40.010 468.99 182.59
## - DC     1    12.934 441.91 183.30
## - DMC    1    22.914 451.89 189.04
## 
## Step:  AIC=175.72
## logarea ~ X + Y + month + day + DMC + DC + temp + RH
## 
##         Df Sum of Sq    RSS    AIC
## - temp   1     0.229 429.29 173.85
## - X      1     0.899 429.96 174.26
## - day    6    18.249 447.31 174.42
## - RH     1     1.723 430.78 174.75
## - Y      1     2.156 431.22 175.01
## <none>               429.06 175.72
## - month  9    39.933 468.99 180.59
## - DC     1    13.726 442.79 181.81
## - DMC    1    24.395 453.45 187.93
## 
## Step:  AIC=173.85
## logarea ~ X + Y + month + day + DMC + DC + RH
## 
##         Df Sum of Sq    RSS    AIC
## - day    6    18.022 447.31 172.42
## - X      1     1.008 430.30 172.46
## - RH     1     2.032 431.32 173.07
## - Y      1     2.361 431.65 173.26
## <none>               429.29 173.85
## - DC     1    14.038 443.33 180.12
## - DMC    1    24.169 453.46 185.93
## - month  9    55.884 485.17 187.31
## 
## Step:  AIC=172.42
## logarea ~ X + Y + month + DMC + DC + RH
## 
##         Df Sum of Sq    RSS    AIC
## - X      1     0.996 448.31 171.00
## - RH     1     1.333 448.64 171.19
## - Y      1     2.796 450.11 172.03
## <none>               447.31 172.42
## - DC     1    15.327 462.64 179.08
## - month  9    54.709 502.02 184.08
## - DMC    1    27.839 475.15 185.94
## 
## Step:  AIC=171
## logarea ~ Y + month + DMC + DC + RH
## 
##         Df Sum of Sq    RSS    AIC
## - RH     1     1.083 449.39 169.62
## - Y      1     1.834 450.14 170.04
## <none>               448.31 171.00
## - DC     1    14.482 462.79 177.17
## - month  9    53.721 502.03 182.08
## - DMC    1    26.874 475.18 183.96
## 
## Step:  AIC=169.62
## logarea ~ Y + month + DMC + DC
## 
##         Df Sum of Sq    RSS    AIC
## - Y      1     1.775 451.16 168.63
## <none>               449.39 169.62
## - DC     1    14.348 463.74 175.69
## - month  9    53.707 503.10 180.63
## - DMC    1    26.021 475.41 182.08
## 
## Step:  AIC=168.63
## logarea ~ month + DMC + DC
## 
##         Df Sum of Sq    RSS    AIC
## <none>               451.16 168.63
## - DC     1    13.848 465.01 174.40
## - month  9    52.829 503.99 179.09
## - DMC    1    25.059 476.22 180.52
summary(step.forw)
## 
## Call:
## lm(formula = logarea ~ ISI, data = dat[c(seq(1, 12), 15, 16)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0204 -1.0208 -0.0905  0.8410  3.5457 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.14044    0.21716   9.857   <2e-16 ***
## ISI         -0.03826    0.02162  -1.770    0.078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.399 on 255 degrees of freedom
## Multiple R-squared:  0.01213,    Adjusted R-squared:  0.00826 
## F-statistic: 3.132 on 1 and 255 DF,  p-value: 0.07795
summary(step.back)
## 
## Call:
## lm(formula = logarea ~ month + DMC + DC, data = dat[c(seq(1, 
##     12), 15, 16)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7872 -0.9334 -0.0507  0.8514  3.6869 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.840299   0.787455   2.337 0.020245 *  
## monthaug     1.563758   1.231400   1.270 0.205324    
## monthdec     2.225481   1.059302   2.101 0.036672 *  
## monthfeb     0.247090   0.893731   0.276 0.782420    
## monthjul     1.113326   1.081491   1.029 0.304290    
## monthjun     0.079900   1.025060   0.078 0.937934    
## monthmar     0.232653   0.844752   0.275 0.783233    
## monthmay     1.958612   1.567324   1.250 0.212618    
## monthoct     3.979868   1.522284   2.614 0.009492 ** 
## monthsep     2.833618   1.406153   2.015 0.044980 *  
## DMC          0.009194   0.002492   3.689 0.000277 ***
## DC          -0.005187   0.001892  -2.742 0.006553 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.357 on 245 degrees of freedom
## Multiple R-squared:  0.1072, Adjusted R-squared:  0.06711 
## F-statistic: 2.674 on 11 and 245 DF,  p-value: 0.00293
plot(step.back, 1:3)
## Warning: not plotting observations with leverage one:
##   232

In the forward selection model, only the ISI predictor being selected, but with a not very statically significant P-value. In the backward elimination model, more predictors are selected. Also, ‘monthdec’, ‘monthoct’, ‘monthsep’, DMC and DC are statically significant with p-values <0.05.

Finally, we compared these three fitted linear regression models on R^2, adjusted R^2, square root of MSE, AIC and BIC, to see which one performs the best.

table1<- matrix(c(summary(mod_lin)$r.squared, summary(mod_lin)$adj.r.squared, sqrt(mean(mod_lin$residuals^2)),AIC(mod_lin), BIC(mod_lin),
                  summary(mod_lin2)$r.squared, summary(mod_lin2)$adj.r.squared, sqrt(mean(mod_lin2$residuals^2)),AIC(mod_lin2), BIC(mod_lin2),
                  summary(step.forw)$r.squared, summary(step.forw)$adj.r.squared, sqrt(mean(step.forw$residuals^2)),AIC(step.forw), BIC(step.forw),
                  summary(step.back)$r.squared, summary(step.back)$adj.r.squared, sqrt(mean(step.back$residuals^2)),AIC(step.back), BIC(step.back)), ncol=5,nrow=4, byrow=TRUE)
colnames(table1)<- c("R^2", "Adjusted R^2", "Square root of MSE","AIC", "BIC")
rownames(table1)<- c("logarea ~.",
                     "logarea~., (w/o influential points)",
                     "model with forward selection based on AIC",
                     "model with backward elimination based on AIC")
table1 <- as.table(table1)
kable(table1)
R^2 Adjusted R^2 Square root of MSE AIC BIC
logarea ~. 0.1254112 0.0358017 1.425126 1011.5275 1108.6849
logarea~., (w/o influential points) 0.1515462 0.0597222 1.291624 914.8673 1010.6923
model with forward selection based on AIC 0.0121343 0.0082603 1.393706 905.9650 916.6122
model with backward elimination based on AIC 0.1071920 0.0671067 1.324955 899.9630 946.1009

From the summary table above, the forward selection model is the worst. Although the square root of MSE is a bit higher for the backward elimination model compared to the full model, the adjusted R^2 is higher and AIC and BIC values are smaller. Hence, the backward elimination model is considered the best one among these linear regression models. However, from the R^2 value, this model still only explains about 10.7% of the data. We then tried logistic regression model.

Logistic Regression Models

We categorize the outcome as a binary outcome – burn = 1 if burning area>0, burn = 0 if burning area =0.

# Fit Logistic Regression
lg_burn <- glm(burn ~ X+Y+month+day+FFMC+DMC+DC+ISI+temp+RH+wind+rain, family=binomial(link = "logit"),data=new_dat)
summary(lg_burn)
## 
## Call:
## glm(formula = burn ~ X + Y + month + day + FFMC + DMC + DC + 
##     ISI + temp + RH + wind + rain, family = binomial(link = "logit"), 
##     data = new_dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.59243  -1.16846   0.00038   1.11144   1.99644  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -7.282e+00  3.485e+00  -2.090   0.0366 *
## X            5.473e-02  4.895e-02   1.118   0.2636  
## Y            4.653e-02  9.349e-02   0.498   0.6187  
## monthaug    -5.892e-01  1.277e+00  -0.461   0.6446  
## monthdec     1.683e+01  7.894e+02   0.021   0.9830  
## monthfeb     6.951e-01  8.730e-01   0.796   0.4259  
## monthjan    -1.446e+01  1.417e+03  -0.010   0.9919  
## monthjul    -5.139e-01  1.129e+00  -0.455   0.6489  
## monthjun    -5.395e-01  1.042e+00  -0.518   0.6047  
## monthmar    -3.265e-01  8.120e-01  -0.402   0.6876  
## monthmay     1.172e-01  1.630e+00   0.072   0.9427  
## monthnov    -1.591e+01  2.400e+03  -0.007   0.9947  
## monthoct    -1.407e+00  1.515e+00  -0.928   0.3532  
## monthsep    -4.984e-01  1.428e+00  -0.349   0.7270  
## daymon       4.548e-02  3.439e-01   0.132   0.8948  
## daysat      -1.840e-02  3.277e-01  -0.056   0.9552  
## daysun      -4.778e-02  3.178e-01  -0.150   0.8805  
## daythu      -5.141e-02  3.596e-01  -0.143   0.8863  
## daytue       2.252e-01  3.563e-01   0.632   0.5272  
## daywed       3.045e-01  3.713e-01   0.820   0.4122  
## FFMC         5.855e-02  3.780e-02   1.549   0.1214  
## DMC         -1.574e-03  2.830e-03  -0.556   0.5781  
## DC           1.338e-03  1.921e-03   0.697   0.4860  
## ISI         -1.922e-02  2.959e-02  -0.649   0.5160  
## temp         4.648e-02  3.448e-02   1.348   0.1777  
## RH           8.667e-03  9.863e-03   0.879   0.3796  
## wind         7.701e-02  5.926e-02   1.300   0.1937  
## rain        -1.824e+00  1.212e+00  -1.505   0.1323  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 698.49  on 503  degrees of freedom
## Residual deviance: 659.75  on 476  degrees of freedom
## AIC: 715.75
## 
## Number of Fisher Scoring iterations: 15
hoslem.test(new_dat$burn, fitted(lg_burn), g=10)  #not a poor fit (small P-value, poor fit, H0:good fit)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  new_dat$burn, fitted(lg_burn)
## X-squared = 6.6071, df = 8, p-value = 0.5796
gof(lg_burn)  #AUC=64.0%
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##      chiSq df pVal
## PrI      2  2    2
## drI      4  2    4
## PrG      1  1    1
## drG      3  1    3
## PrCT     1  1    1
## drCT     3  1    3
##                  val df pVal
## HL chiSq           9  3    8
## mHL F              7  4    2
## OsRo Z             1  5    9
## SstPgeq0.5 Z       5  5    3
## SstPl0.5 Z         3  5    5
## SstBoth chiSq      6  2    7
## SllPgeq0.5 chiSq   4  1    4
## SllPl0.5 chiSq     2  1    6
## SllBoth chiSq      8  2    1

Given the coefficients of the first logistic regression model, we can see that other variables all being not statistically significant except for the intercept. Only temperature and rain’s P-values are the lowest and closest to the significance level. All variables in our model are continuous. Thus, due to large number of covariate patterns, we used Hosmer and Lemeshow goodness of fit (GOF) test. Given the P-value = 0.5797 > 0.05 from the GOF test, we fail to reject the null hypothesis that this is a good fit. Therefore, from Hosmer and Lemeshow GOF test, this logistic regression model seems to have a good fit. We then plotted the Receiver Operating Curve (ROC) and see the area under the curve (AUC). The larger the AUC, the better the fitting. Here, AUC = 64.0%, which is acceptable. Then, we tried to improve the model for better fitting by including quadratic terms, based on the significant variables in the linear regression model and two variables being close to the significance level from the previous logistic regression model,.

lg_burn2 <- glm(burn ~X+Y+month+day+FFMC+I(FFMC^2)+DMC+I(DMC^2)+DC+I(DC^2)+ISI+temp+I(temp^2)+RH+wind+rain+I(rain^2), family=binomial(link = "logit"), data=new_dat)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lg_burn2)
## 
## Call:
## glm(formula = burn ~ X + Y + month + day + FFMC + I(FFMC^2) + 
##     DMC + I(DMC^2) + DC + I(DC^2) + ISI + temp + I(temp^2) + 
##     RH + wind + rain + I(rain^2), family = binomial(link = "logit"), 
##     data = new_dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.77127  -1.12232   0.00035   1.10094   1.59486  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.672e+01  2.402e+01  -1.112   0.2660  
## X            6.841e-02  4.994e-02   1.370   0.1708  
## Y            5.984e-02  9.478e-02   0.631   0.5278  
## monthaug     3.866e-01  1.668e+00   0.232   0.8167  
## monthdec     1.750e+01  7.858e+02   0.022   0.9822  
## monthfeb     6.733e-01  8.864e-01   0.760   0.4475  
## monthjan    -1.481e+01  1.227e+03  -0.012   0.9904  
## monthjul     5.046e-01  1.550e+00   0.326   0.7448  
## monthjun     1.345e-01  1.369e+00   0.098   0.9217  
## monthmar    -1.803e-01  8.330e-01  -0.216   0.8287  
## monthmay     3.846e-01  1.666e+00   0.231   0.8175  
## monthnov    -1.527e+01  2.400e+03  -0.006   0.9949  
## monthoct     3.664e-01  1.816e+00   0.202   0.8401  
## monthsep     4.587e-01  1.708e+00   0.269   0.7883  
## daymon      -8.509e-02  3.534e-01  -0.241   0.8097  
## daysat       2.603e-02  3.329e-01   0.078   0.9377  
## daysun      -7.932e-02  3.260e-01  -0.243   0.8078  
## daythu      -4.474e-02  3.677e-01  -0.122   0.9032  
## daytue       2.633e-01  3.702e-01   0.711   0.4768  
## daywed       3.617e-01  3.821e-01   0.947   0.3439  
## FFMC         5.689e-01  5.669e-01   1.004   0.3155  
## I(FFMC^2)   -3.156e-03  3.379e-03  -0.934   0.3502  
## DMC          2.154e-02  1.222e-02   1.763   0.0780 .
## I(DMC^2)    -7.148e-05  3.669e-05  -1.948   0.0514 .
## DC          -6.328e-03  5.707e-03  -1.109   0.2675  
## I(DC^2)      6.665e-06  5.196e-06   1.283   0.1996  
## ISI         -1.373e-02  3.212e-02  -0.428   0.6690  
## temp        -1.260e-01  1.145e-01  -1.100   0.2713  
## I(temp^2)    4.519e-03  2.659e-03   1.699   0.0893 .
## RH           9.781e-03  1.041e-02   0.940   0.3475  
## wind         8.665e-02  6.047e-02   1.433   0.1518  
## rain        -9.369e+01  4.306e+03  -0.022   0.9826  
## I(rain^2)    7.591e+01  4.517e+03   0.017   0.9866  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 698.49  on 503  degrees of freedom
## Residual deviance: 641.42  on 471  degrees of freedom
## AIC: 707.42
## 
## Number of Fisher Scoring iterations: 15
hoslem.test(new_dat$burn, fitted(lg_burn2), g=10) #not a poor fit
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  new_dat$burn, fitted(lg_burn2)
## X-squared = 3.6565, df = 8, p-value = 0.8867
gof(lg_burn2) #AUC=66.4%
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

##      chiSq df pVal
## PrI      2  2    2
## drI      4  2    4
## PrG      1  1    1
## drG      3  1    3
## PrCT     1  1    1
## drCT     3  1    3
##                  val df pVal
## HL chiSq           6  3    7
## mHL F              2  4    6
## OsRo Z             5  5    8
## SstPgeq0.5 Z       4  5    2
## SstPl0.5 Z         3  5    5
## SstBoth chiSq      9  2    1
## SllPgeq0.5 chiSq   7  1    3
## SllPl0.5 chiSq     1  1    8
## SllBoth chiSq      8  2    4
anova(lg_burn, lg_burn2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: burn ~ X + Y + month + day + FFMC + DMC + DC + ISI + temp + RH + 
##     wind + rain
## Model 2: burn ~ X + Y + month + day + FFMC + I(FFMC^2) + DMC + I(DMC^2) + 
##     DC + I(DC^2) + ISI + temp + I(temp^2) + RH + wind + rain + 
##     I(rain^2)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       476     659.75                        
## 2       471     641.42  5   18.323 0.002567 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the coefficients of the variables, we can see that there’re more variables close to the significance level (P-value=0.05) and much closer than those in #1. In this logistic regression model, DMC and temperature seem to be the influential and significant predictors.

We also used Hosmer and Lemeshow goodness of fit (GOF) test again for the new logistic regression model. Since the P-value=0.8867 > 0.05, fwe ail to reject the null hypothesis that this is a good fit. This time, the p-value is larger than the first model. Again, from the Hosmer and Lemeshow GOF test, the #2 logistic model also seems to be a good fit. Look at the ROC curve and AUC, now AUC=64.7%, larger than the first model. And we know that larger the AUC, the better the model fit. Thus, after adjusting for these variables, the model seems to have a better fit.

Besides the Hosmer and Lemeshow GOF test and ROC curve, we further compared these two logistic regression models with Likelihood Ratio Test (LRT) which follows a Chi-square distribution since our two models are nested. H0: reduced model is sufficient (here, logistic #1 is sufficient) and H1: full model is preferred (here, logistic #2 is preferred). Since P-value=0.002567 < 0.05, we reject the null hypothesis that reduced model is sufficient. Therefore, logistic model #2 is preferred and we need these quadratic terms in the model.

table3<- matrix(c(AIC(lg_burn), BIC(lg_burn),
                  AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(table3)<- c("AIC", "BIC")
rownames(table3)<- c("logistic model #1",
                     "logistic model #2")
table3 <- as.table(table3)
kable(table3)
AIC BIC
logistic model #1 715.7466 833.9788
logistic model #2 707.4232 846.7682

Now, we have a linear regression model and a logistic regression model, the question is for these two models, which one is better? Hence, we compared the best linear with the best logistic via AIC, BIC, and we concluded that the logistic model had a better fit since it has a smaller AIC and BIC value seeing from below table, and the significant factors from the regression models are X, Y, FFMC, DMC, DC, ISI, temp, RH, wind that we can take them into further machine learning prediction.

table_ll<- matrix(c(AIC(step.back), BIC(step.back),
                  AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(table_ll)<- c("AIC", "BIC")
rownames(table_ll)<- c("backward elimination linear model",
                     "logistic model #2")
table_ll <- as.table(table_ll)
kable(table_ll)
AIC BIC
backward elimination linear model 899.9630 946.1009
logistic model #2 707.4232 846.7682

Machine Learning Models

Finally, we decided to predict the occurrence of forest fire using machine learning. Based on the result from regression analysis, we selected the most significant predictors (X, Y, FFMC, DMC, DC, ISI, temp, RH, wind) from all variables in the modeling process.

Since we would like to predict a binary outcome, i.e. whether there would be forest fire or not, we used classification models including logistic machine learning model, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), k-nearest neighbors (knn), classification tree, and random forest. In the modeling process, we divided the dataset in half for training and testing sets.

Logistic Machine Learning
y <- new_dat$burn
set.seed(1)
train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]

#Logistic Machine Learning
logistic_fit <- glm(burn~X+Y+month+day+FFMC+I(FFMC^2)+DMC+I(DMC^2)+DC+I(DC^2)+ISI+temp+I(temp^2)+RH+wind+rain+I(rain^2), train_set, family=binomial(link = "logit"))
p_hat_logit <- predict(logistic_fit, newdata=test_set)
y_hat_logit <- ifelse(p_hat_logit>0.5, 1,0)
confusionMatrix(data=as.factor(y_hat_logit), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 72 60
##          1 56 64
##                                          
##                Accuracy : 0.5397         
##                  95% CI : (0.476, 0.6024)
##     No Information Rate : 0.5079         
##     P-Value [Acc > NIR] : 0.1723         
##                                          
##                   Kappa : 0.0787         
##                                          
##  Mcnemar's Test P-Value : 0.7806         
##                                          
##             Sensitivity : 0.5625         
##             Specificity : 0.5161         
##          Pos Pred Value : 0.5455         
##          Neg Pred Value : 0.5333         
##              Prevalence : 0.5079         
##          Detection Rate : 0.2857         
##    Detection Prevalence : 0.5238         
##       Balanced Accuracy : 0.5393         
##                                          
##        'Positive' Class : 0              
## 

QDA, LDA

set.seed(1)

qda_fit <- qda(burn ~ X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
qda_preds <- predict(qda_fit, test_set)
confusionMatrix(data=as.factor(qda_preds$class), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 29 27
##          1 99 97
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4366, 0.5634)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 0.6237          
##                                           
##                   Kappa : 0.0087          
##                                           
##  Mcnemar's Test P-Value : 2.529e-10       
##                                           
##             Sensitivity : 0.2266          
##             Specificity : 0.7823          
##          Pos Pred Value : 0.5179          
##          Neg Pred Value : 0.4949          
##              Prevalence : 0.5079          
##          Detection Rate : 0.1151          
##    Detection Prevalence : 0.2222          
##       Balanced Accuracy : 0.5044          
##                                           
##        'Positive' Class : 0               
## 
lda_fit <- lda(burn ~ X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
lda_preds <- predict(lda_fit, test_set)
confusionMatrix(data = as.factor(lda_preds$class), reference = as.factor(test_set$burn))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38 36
##          1 90 88
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.4366, 0.5634)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 0.6237          
##                                           
##                   Kappa : 0.0065          
##                                           
##  Mcnemar's Test P-Value : 2.34e-06        
##                                           
##             Sensitivity : 0.2969          
##             Specificity : 0.7097          
##          Pos Pred Value : 0.5135          
##          Neg Pred Value : 0.4944          
##              Prevalence : 0.5079          
##          Detection Rate : 0.1508          
##    Detection Prevalence : 0.2937          
##       Balanced Accuracy : 0.5033          
##                                           
##        'Positive' Class : 0               
## 
p1 <- prediction(p_hat_logit, test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")
p2 <- prediction(qda_preds$posterior[,2], test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")
p3 <- prediction(lda_preds$posterior[,2], test_set$burn) %>% performance(measure = "tpr", x.measure="tnr")

p2 %>% plot(main="QDA ROC",xlim=c(1.5,-0.5))

p3 %>% plot(main="LDA ROC",xlim=c(1.5,-0.5))

# Logistic regression AUC
prediction(p_hat_logit, test_set$burn) %>%
  performance(measure = "auc") %>%
  .@y.values   #0.545
## [[1]]
## [1] 0.5454259
# QDA AUC
prediction(qda_preds$posterior[,2], test_set$burn) %>%
  performance(measure = "auc") %>%
  .@y.values   #0.510
## [[1]]
## [1] 0.5101436
# LDA AUC
prediction(lda_preds$posterior[,2], test_set$burn) %>%
  performance(measure = "auc") %>%
  .@y.values  #0.541
## [[1]]
## [1] 0.5010711

Knn

train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]
train_set$burn <- as.factor(train_set$burn)
test_set$burn <- as.factor(test_set$burn)
# Knn
knn_fit <- knn3(burn~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set,k=5)
knn_hat <- predict(knn_fit, test_set, type = "class")
tab <- table(pred=knn_hat, truth=test_set$burn)
confusionMatrix(tab)
## Confusion Matrix and Statistics
## 
##     truth
## pred  0  1
##    0 71 45
##    1 57 79
##                                           
##                Accuracy : 0.5952          
##                  95% CI : (0.5318, 0.6564)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 0.003294        
##                                           
##                   Kappa : 0.1915          
##                                           
##  Mcnemar's Test P-Value : 0.276082        
##                                           
##             Sensitivity : 0.5547          
##             Specificity : 0.6371          
##          Pos Pred Value : 0.6121          
##          Neg Pred Value : 0.5809          
##              Prevalence : 0.5079          
##          Detection Rate : 0.2817          
##    Detection Prevalence : 0.4603          
##       Balanced Accuracy : 0.5959          
##                                           
##        'Positive' Class : 0               
## 
knn_prob <- predict(knn_fit, test_set, type="prob")[,2]
plot(performance(prediction(knn_prob, test_set$burn), "tpr", "tnr"),main="KNN ROC", xlim=c(1,0))

prediction(knn_prob, test_set$burn) %>%
  performance(measure = "auc") %>%
  .@y.values  #0.607
## [[1]]
## [1] 0.6066028
plot(p1, col="red",xlim=c(1.5,-0.5), main="Logistic, LDA, QDA, kNN ROC Plot")
plot(p2, add=TRUE, col="blue")
plot(p3, add=TRUE, col="green")
plot(performance(prediction(knn_prob, test_set$burn), "tpr", "tnr"),add=TRUE,col="hot pink")

Machine Learning Summary

Among these four models, kNN with k=5 has the greatest accuracy and the most balanced sensitivity and specificity. It also has the highest AUC value. The logistic model has slightly lower values for these parameters but still pretty balanced. Therefore, we chose kNN and logistic model over the other two.

Classification tree, Random Forest

## Classification tree, Random Forest
set.seed(1)
# Classification Tree
train_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
train_set <- new_dat[train_index, ]
test_set <- new_dat[-train_index, ]
train_set$burn <- as.factor(train_set$burn)
test_set$burn <- as.factor(test_set$burn)

tree_fit <- tree(burn ~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data=train_set)
summary(tree_fit)
## 
## Classification tree:
## tree(formula = burn ~ X + Y + FFMC + DMC + DC + ISI + temp + 
##     RH + wind, data = train_set)
## Variables actually used in tree construction:
## [1] "DMC"  "X"    "RH"   "temp" "wind"
## Number of terminal nodes:  9 
## Residual mean deviance:  1.147 = 278.8 / 243 
## Misclassification error rate: 0.3095 = 78 / 252
pred <- predict(tree_fit, test_set,type = "class")
confusionMatrix(as.factor(pred), reference = test_set$burn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  20  17
##          1 108 107
##                                           
##                Accuracy : 0.504           
##                  95% CI : (0.4405, 0.5673)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 0.5751          
##                                           
##                   Kappa : 0.0189          
##                                           
##  Mcnemar's Test P-Value : 8.29e-16        
##                                           
##             Sensitivity : 0.15625         
##             Specificity : 0.86290         
##          Pos Pred Value : 0.54054         
##          Neg Pred Value : 0.49767         
##              Prevalence : 0.50794         
##          Detection Rate : 0.07937         
##    Detection Prevalence : 0.14683         
##       Balanced Accuracy : 0.50958         
##                                           
##        'Positive' Class : 0               
## 
plot(tree_fit, type = "uniform")
text(tree_fit, cex = 1)

cv_tree <- cv.tree(tree_fit)
plot(cv_tree)
best.size <- cv_tree$size[which(cv_tree$dev==min(cv_tree$dev))] # which size is better?
best.size
## [1] 1
prune_fit <- prune.tree(tree_fit, best = 4)
plot(prune_fit, type="uniform")
text(prune_fit)

# Random Forest
rf <- randomForest(burn~X+Y+FFMC+DMC+DC+ISI+temp+RH+wind, data = train_set)
rf
## 
## Call:
##  randomForest(formula = burn ~ X + Y + FFMC + DMC + DC + ISI +      temp + RH + wind, data = train_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 39.29%
## Confusion matrix:
##    0  1 class.error
## 0 58 61   0.5126050
## 1 38 95   0.2857143
preds_rf <- predict(rf, test_set)
confusionMatrix(as.factor(preds_rf), reference = test_set$burn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 56 48
##          1 72 76
##                                           
##                Accuracy : 0.5238          
##                  95% CI : (0.4602, 0.5869)
##     No Information Rate : 0.5079          
##     P-Value [Acc > NIR] : 0.32974         
##                                           
##                   Kappa : 0.0503          
##                                           
##  Mcnemar's Test P-Value : 0.03576         
##                                           
##             Sensitivity : 0.4375          
##             Specificity : 0.6129          
##          Pos Pred Value : 0.5385          
##          Neg Pred Value : 0.5135          
##              Prevalence : 0.5079          
##          Detection Rate : 0.2222          
##    Detection Prevalence : 0.4127          
##       Balanced Accuracy : 0.5252          
##                                           
##        'Positive' Class : 0               
## 
varImpPlot(rf,main = "Random Forest Important Variable Plot")

Using the cv.tree function, we determined the optimal number of terminal nodes to be 4. For this model, the accuracy is 0.504, while the sensitivity (0.1563) and specificity (0.8629) are extremely unbalanced. We would not consider it a good model to predict forest fires.

For the random forest model, the accuracy is 0.5317, which is slightly lower than the logistic model and kNN model. The sensitivity (0.4766) and specificity is relatively balanced (0.5887). Thus, this model is suitable for classifying the outcome of forest fire.

Final Analysis

  • What did you learn about the data? How did you answer the questions? How can you justify your answers?

From the Shiny app and plots, forest fires are most common during August and September, and most frequently occur at temperature around 20 °C and relative humidity around 40%. The distribution for wind speed is more spread out. Since the frequency and burning area of forest fires does not vary significantly between during weekdays and weekends, human activity does not have a great impact on forest fires.

Although the number of occurrence of forest fires are not linearly associated with weather observations, forest fires of large scale are more likely to occur under high temperature and low humidity.

For regression analysis, from previous analysis, backward elimination based on AIC linear regression model is the best among linear regression models and second logistic model is the best among two logistic models. After comparing these two models, we concluded that the second logistic regression model is the best fitting one from the below table.

tablefinal<- matrix(c(AIC(step.back), BIC(step.back),
                  AIC(lg_burn2), BIC(lg_burn2)), ncol=2,nrow=2, byrow=TRUE)
colnames(tablefinal)<- c("AIC", "BIC")
rownames(tablefinal)<- c("linear #4: model with backward elimination based on AIC",
                     "logistic model #2")
tablefinal <- as.table(tablefinal)
kable(tablefinal)
AIC BIC
linear #4: model with backward elimination based on AIC 899.9630 946.1009
logistic model #2 707.4232 846.7682

By comparing the P-values, we ranked the most significant variables as following: X, Y, FFMC, DMC, DC, ISI, temp, RH, wind.

For machine learning models, we considered kNN model, logistic model, and random forest model the best models to predict forest fires because they have the highest accuracy value and most balanced specificity and sensitivity.